Goto

Collaborating Authors

 generating image


Generating Images with Multimodal Language Models

Neural Information Processing Systems

We propose a method to fuse frozen text-only large language models (LLMs) with pre-trained image encoder and decoder models, by mapping between their embedding spaces. Our model demonstrates a wide suite of multimodal capabilities: image retrieval, novel image generation, and multimodal dialogue. Ours is the first approach capable of conditioning on arbitrarily interleaved image and text inputs to generate coherent image (and text) outputs. To achieve strong performance on image generation, we propose an efficient mapping network to ground the LLM to an off-the-shelf text-to-image generation model. This mapping network translates hidden representations of text into the embedding space of the visual models, enabling us to leverage the strong text representations of the LLM for visual outputs.


Generating Images with Perceptual Similarity Metrics based on Deep Networks

Neural Information Processing Systems

We propose a class of loss functions, which we call deep perceptual similarity metrics (DeePSiM), allowing to generate sharp high resolution images from compressed abstract representations. Instead of computing distances in the image space, we compute distances between image features extracted by deep neural networks. This metric reflects perceptual similarity of images much better and, thus, leads to better results. We demonstrate two examples of use cases of the proposed loss: (1) networks that invert the AlexNet convolutional network; (2) a modified version of a variational autoencoder that generates realistic high-resolution random images.


Evaluating and comparing gender bias across four text-to-image models

Hammad, Zoya, Sowah, Nii Longdon

arXiv.org Artificial Intelligence

SUMMARY As we increasingly use Artificial Intelligence (AI) in decision-making for industries like healthcare, finance, e-commerce, and even entertainment, it is crucial to also reflect on the ethical aspects of AI, for example the inclusivity and fairness of the information it provides. In this work, we aimed to evaluate different text-to-image AI models and compare the degree of gender bias they present. The evaluated models were Stable Diffusion XL (SDXL), Stable Diffusion Cascade (SC), DALL-E and Emu. We hypothesized that DALL-E and Stable Diffusion, which are comparatively older models, would exhibit a noticeable degree of gender bias towards men, while Emu, which was recently released by Meta AI, would have more balanced results. As hypothesized, we found that both Stable Diffusion models exhibit a noticeable degree of gender bias while Emu demonstrated more balanced results (i.e less gender bias). However, interestingly, Open AI's DALL-E exhibited almost opposite results, such that the ratio of women to men was significantly higher in most cases tested. Here, although we still observed a bias, the bias favored females over males. This bias may be explained by the fact that OpenAI changed the prompts at its backend, as observed during our experiment. We also observed that Emu from Meta AI utilized user information while generating images via WhatsApp. We also proposed some potential solutions to avoid such biases, including ensuring diversity across AI research teams and having diverse datasets. INTRODUCTION Artificial Intelligence (AI) has been growing remarkably in recent years, impacting numerous aspects of our daily lives. One such area of significant advancement is text-to-image generation.


Tackling fake images in cybersecurity -- Interpretation of a StyleGAN and lifting its black-box

Laubmann, Julia, Reschke, Johannes

arXiv.org Artificial Intelligence

--In today's digital age, concerns about the dangers of AI-generated images are increasingly common. One powerful tool in this domain is StyleGAN (style-based generative adversarial networks), a generative adversarial network capable of producing highly realistic synthetic faces. T o gain a deeper understanding of how such a model operates, this work focuses on analyzing the inner workings of StyleGAN's generator component. Key architectural elements and techniques, such as the Equalized Learning Rate, are explored in detail to shed light on the model's behavior . A StyleGAN model is trained using the PyT orch framework, enabling direct inspection of its learned weights. Through pruning, it is revealed that a significant number of these weights can be removed without drastically affecting the output, leading to reduced computational requirements. Moreover, the role of the latent vector - which heavily influences the appearance of the generated faces - is closely examined. Global alterations to this vector primarily affect aspects like color tones, while targeted changes to individual dimensions allow for precise manipulation of specific facial features. This ability to fine-tune visual traits is not only of academic interest but also highlights a serious ethical concern: the potential misuse of such technology. Malicious actors could exploit this capability to fabricate convincing fake identities, posing significant risks in the context of digital deception and cybercrime. In today's modern age, models to generate human faces are based on StyleGANs, which have been trained with the FFHQ (Flickr-Faces-High-Quality) image dataset [1].


Generating Images with Multimodal Language Models

Neural Information Processing Systems

We propose a method to fuse frozen text-only large language models (LLMs) with pre-trained image encoder and decoder models, by mapping between their embedding spaces. Our model demonstrates a wide suite of multimodal capabilities: image retrieval, novel image generation, and multimodal dialogue. Ours is the first approach capable of conditioning on arbitrarily interleaved image and text inputs to generate coherent image (and text) outputs. To achieve strong performance on image generation, we propose an efficient mapping network to ground the LLM to an off-the-shelf text-to-image generation model. This mapping network translates hidden representations of text into the embedding space of the visual models, enabling us to leverage the strong text representations of the LLM for visual outputs.


Seeing Soundscapes: Audio-Visual Generation and Separation from Soundscapes Using Audio-Visual Separator

Kang, Minjae, Brandão, Martim

arXiv.org Artificial Intelligence

Recent audio-visual generative models have made substantial progress in generating images from audio. However, existing approaches focus on generating images from single-class audio and fail to generate images from mixed audio. To address this, we propose an Audio-Visual Generation and Separation model (AV-GAS) for generating images from soundscapes (mixed audio containing multiple classes). Our contribution is threefold: First, we propose a new challenge in the audio-visual generation task, which is to generate an image given a multi-class audio input, and we propose a method that solves this task using an audio-visual separator. Second, we introduce a new audio-visual separation task, which involves generating separate images for each class present in a mixed audio input. Lastly, we propose new evaluation metrics for the audio-visual generation task: Class Representation Score (CRS) and a modified R@K. Our model is trained and evaluated on the VGGSound dataset. We show that our method outperforms the state-of-the-art, achieving 7% higher CRS and 4% higher R@2* in generating plausible images with mixed audio.


Reviews: Generating Images with Perceptual Similarity Metrics based on Deep Networks

Neural Information Processing Systems

I think the most important contribution of the manuscript is to describe a method that substantially improves image reconstruction from compressed deep network representations (e.g. In that regard I would have liked an analysis of the compression rate for the reconstructions from the different feature spaces. In particular because the difference in quality between layer conv5 and fc6 doesn't seem too large, whereas there is a 10-fold reduction in dimensionality ((13 x 13 x 256 43k vs. to 4k) in the feature representation. One factor that should definitely be discussed in the paper is that it appears as if the adversarial prior enforces to use image data from the training set in the reconstruction. This is not necessarily a problem in terms of image compression, but is an important factor: a careful choice of training data might be important depending on what type of images one wants to compress.


How Do Generative Models Draw a Software Engineer? A Case Study on Stable Diffusion Bias

Fadahunsi, Tosin, d'Aloisio, Giordano, Di Marco, Antinisca, Sarro, Federica

arXiv.org Artificial Intelligence

Generative models are nowadays widely used to generate graphical content used for multiple purposes, e.g. web, art, advertisement. However, it has been shown that the images generated by these models could reinforce societal biases already existing in specific contexts. In this paper, we focus on understanding if this is the case when one generates images related to various software engineering tasks. In fact, the Software Engineering (SE) community is not immune from gender and ethnicity disparities, which could be amplified by the use of these models. Hence, if used without consciousness, artificially generated images could reinforce these biases in the SE domain. Specifically, we perform an extensive empirical evaluation of the gender and ethnicity bias exposed by three versions of the Stable Diffusion (SD) model (a very popular open-source text-to-image model) - SD 2, SD XL, and SD 3 - towards SE tasks. We obtain 6,720 images by feeding each model with two sets of prompts describing different software-related tasks: one set includes the Software Engineer keyword, and one set does not include any specification of the person performing the task. Next, we evaluate the gender and ethnicity disparities in the generated images. Results show how all models are significantly biased towards male figures when representing software engineers. On the contrary, while SD 2 and SD XL are strongly biased towards White figures, SD 3 is slightly more biased towards Asian figures. Nevertheless, all models significantly under-represent Black and Arab figures, regardless of the prompt style used. The results of our analysis highlight severe concerns about adopting those models to generate content for SE tasks and open the field for future research on bias mitigation in this context.


Generating Images with Multimodal Language Models

Neural Information Processing Systems

We propose a method to fuse frozen text-only large language models (LLMs) with pre-trained image encoder and decoder models, by mapping between their embedding spaces. Our model demonstrates a wide suite of multimodal capabilities: image retrieval, novel image generation, and multimodal dialogue. Ours is the first approach capable of conditioning on arbitrarily interleaved image and text inputs to generate coherent image (and text) outputs. To achieve strong performance on image generation, we propose an efficient mapping network to ground the LLM to an off-the-shelf text-to-image generation model. This mapping network translates hidden representations of text into the embedding space of the visual models, enabling us to leverage the strong text representations of the LLM for visual outputs.


Why Is AI So Bad at Generating Images of Kamala Harris?

WIRED

When Elon Musk shared an image showing Kamala Harris dressed as a "communist dictator" on X last week, it was quite obviously a fake, seeing as Harris is neither a communist nor, to the best of our knowledge, a Soviet cosplayer. And, as many observers noted, the woman in the photo, presumably generated by X's Grok tool, had only a passing resemblance to the vice president. "AI still is unable to accurately depict Kamala Harris," one X user wrote. "Grok put old Eva Longoria in a snazzy outfit and called it a day," another quipped, noting the similarity of the "dictator" pictured to the Desperate Housewives star. "AI just CANNOT replicate Kamala Harris," a third posted.